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Abstract 

This  paper  presents  a  methodology  for  improving  the  speed  of  high-speed  adders.  As  a  starting  point,  a 
previously  proposed  method,  called  “speculative  completion,”  is  used  in  which  fast-terminating  additions 
are  automatically  detected.  Unlike  the  previous  design,  the  method  proposed  in  this  paper  is  able  to  adapt 
dynamically  to  (1)  application-specific  behavior  and  (2)  to  adder-specific  behavior,  resulting  in  a  higher 
detection  rate  of  fast  additions  and,  consequently,  a  faster  average-case  speed  for  addition.  Our  experimental 
results  show  detection  rates  of  over  99%,  and  adder  average-case  speed  improvements  of  up  to  14.8%. 
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1  Introduction 


The  promise  of  asynehronous  design  is  based  partly  on  obtaining  average  case  performance  from  circuits. 
Synchronous  circuits  must  be  designed  for  worst  case  performance  since  the  data  must  always  be  ready 
when  the  clock  arrives.  Asynchronous  circuits,  on  the  other  hand,  can  in  principle  indicate  when  the  circuit 
has  completed.  This  promise  is  often  unrealized  in  the  case  of  datapath  elements  as  the  methodology  (e.g., 
dual  rail  design  [19])  used  to  generate  the  completion  signal  incurs  significant  overhead.  Thus,  even  asyn¬ 
chronous  circuits  are  often  designed  for  worst-case  delay.  Nowick  et  al  [13,  11]  attacked  this  problem  for 
adders  by  designing  an  adder  which  can  complete  early,  but  does  not  incur  the  penalty  of  a  full  completion 
detection  implementation.  We  extend  their  work  by  developing  a  methodology  which  automatically  creates 
specialized  early-completion  adders.  We  also  redesign  the  adder  circuit  such  that  a  static-CMOS  design  may 
be  used. 

As  technology  continues  to  scale,  increasing  numbers  of  transistors  are  available  to  the  designer.  Using 
these  transistors  effectively  presents  many  challenges  such  as  design  time,  wire  delay,  and  power  dissipa¬ 
tion.  Traditionally,  the  extra  transistors  have  been  used  to  increase  cache  sizes,  add  complicated  logic  to 
datapaths,  or  add  extra  functionality.  An  alternative  approach  is  to  use  the  transistors  to  decrease  design 
time,  reduce  barriers  to  further  scaling,  and  lower  power.  This  approach,  called  spatial  computation,  dis¬ 
tributes  the  computation  over  space,  i.e.,  instead  of  time-multiplexing  function  units  for  different  parts  of  a 
computation,  it  dedicates  a  function  unit  to  each  operation  in  a  program  [7,  18,  5].  This  approach  reduces 
energy-delay,  decreases  reliance  on  global  wires  and  global  controllers,  and  eliminates  many  central  struc¬ 
tures  (e.g.,  register  files).  A  direcf  resulf  is  fhaf  every  adder  in  fhe  program  is  implemenfed  by  a  separafe 
piece  of  silicon.  Thus,  if  opens  fhe  opporfunify  fo  specialize  each  function  unif,  in  fhis  case  each  adder,  for 
fhe  confexf  in  which  if  is  used. 

In  order  fo  gain  some  of  fhe  benefif  of  average  case  design.  Nowick  ef.  al.  [1 1]  infroduced  a  modification 
of  fhe  Brenf-Kung  adder  [1]  which  could  defecf  some  cases  in  which  fhe  compufafion  complefed  before  fhe 
worsf  case  delay  fime.  The  idea  is  fo  calculafe,  in  parallel  wifh  fhe  addition,  a  condition  which  indicafes 
whefher  fhe  lasf  few  levels  of  fhe  adder  are  needed.  This  calculafion  requires  less  overhead  fhan  fradifional 
complefion-defecfion  approaches,  buf  fails  fo  cafch  all  cases  in  which  fhe  adder  is  done  earlier  fhan  fhe 
worsf-case  fime.  They  furfher  improve  fheir  resulfs  by  special  casing  fheir  design,  by  hand,  for  differenf 
classes  of  adders  (e.g.  address  arifhmefic  adders  versus  general  purpose  adders). 

In  fhis  paper  we  describe  our  mefhodology  for  aufomafically  specializing  function  unifs.  If  is  mofivafed 
in  parf  by  our  approach  fo  spafial  compufafion;  we  fransform  high-level  language  programs  direcfly  info 
hardware.  If  is  fhus  nafural  fo  use  a  profile-based  approach  fo  hardware  generafion.  By  profiling  fhe  pro¬ 
gram  we  can  defermine  fhe  sfafisfical  nafure  of  fhe  inpufs  fo  each  adder.  We  combine  fhe  profile  dafa  wifh 
compiler  passes  which  perform  bif-widfh  and  consfanf-bif  analyses  [4]  in  order  fo  specialize  fhe  adders  in 
fwo  ways.  Firsf,  fhe  consfanf-bifs  and  bif-widfh  of  fhe  adder  is  modified  fo  suif  fhe  inpuf  sef.  Second,  fhe 
early-ferminafion  defecfion  circuif  is  optimized  so  fhaf  if  will  miss  fhe  fewesf  number  of  cases  in  which  fhe 
adder  is  done  early.  Because  fhe  compiler  has  access  fo  fhe  disfribufion  of  inpufs  and  if  creafes  separafe 
adders  for  each  add  in  fhe  program,  if  can  creafe  adders  specialized  fo  fheir  confexf  wifhouf  requiring  any 
“by-hand”  design. 

To  validafe  our  mefhodology,  we  analyze  Mediabench[9],  MiBench[8],  and  Spec[16]  benchmark  pro¬ 
grams.  We  explore  several  differenf  algorifhms  for  defermining  fhe  defecfion  circuif  which  depend  on  how 
fhe  informafion  in  fhe  profiled  dafa  sef  is  used.  Experimenfal  resulfs  indicafe  fhaf,  when  fhe  number  of  lifer- 
als  used  in  early-ferminafion  varies  from  15  fo  35,  as  many  as  96-99%  of  fhe  early  addifions  are  defecfed,  as 
compared  wifh  an  average  of  76-93%  for  fhe  Nowick  ef.  al.  defecfion  nefwork. 
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In  Section  2  we  present  some  background  information  including  the  basic  adder  used  in  [13],  which  is 
the  basis  for  our  design,  as  well  as  an  overview  of  “Spatial  Computation.”  Section  3  introduces  our  algorithm 
for  building  early-termination  networks.  Section  4  describes  the  modifications  we  made  to  the  adder  design 
which  reduces  the  adverse  impact  of  including  an  early  termination  path.  A  detailed  evaluation  is  presented 
in  Section  5,  while  the  paper  concludes  with  a  summary  and  directions  for  future  work. 

2  Background 

This  section  reviews  some  previous  contributions  which  are  useful  in  understanding  our  approach  to  build¬ 
ing  speculative-completion  adders.  First,  techniques  for  asynchronous  completion  detection  are  discussed. 
Second,  the  basic  Brent-Kung  adder  is  described.  Third,  the  original  Nowick  speculative  completion  adder 
is  reviewed.  Finally,  an  overview  of  spatial  computation  is  provided. 

2.1  Completion  Detection 

Asynchronous  circuits  must  rely  on  a  signal  other  than  the  clock  to  indicate  the  availability  of  results.  The 
techniques  for  generating  this  signal  usually  fall  into  one  of  two  categories:  matched  delay  and  completion 
detection. 

A  matched  delay  [6,  2,  14]  approach  computes  the  worst-case  delay  of  the  functional  unit  and  inserts  a 
delay  element  which  matches  this  worst-case  delay  through  the  functional  unit.  The  delay  element  can  either 
be  an  inverter  chain,  or,  for  a  tighter  match,  a  replication  of  the  critical  path  through  the  functional  unit.  The 
main  advantage  of  this  method  is  that  widely  available  synchronous  implementations  for  functional  units  can 
be  used;  the  disadvantage  is  that  the  delay  of  the  functional  unit  is  data-independent  and  always  worst-case. 

A  completion  detection  method  [10,  15]  is  data-dependent.  It  uses  the  result  signals,  usually  encoded 
using  delay  insensitive  codes  [19],  to  detect  when  the  computation  is  done.  However,  its  main  disadvantage 
is  the  completion  detection  network  which  adds  overhead  to  the  delay  of  the  result  and  may  offset  the 
savings  of  detecting  true  average-case.  In  addition,  the  use  of  delay-insensitive  codes  requires  the  design  of 
non-standard  functional  units  and  vastly  increases  their  area. 

The  methodology  used  in  this  paper  falls  in  between  these  two  methods.  The  method,  called  “speculative 
completion”  was  introduced  by  Nowick  in  [13,  1 1]  and  is  described  in  detail  below. 

2.2  Brent-Kung  Adder 

In  [1],  Brent  and  Kung  proposed  a  fast,  parallel  carry-lookahead  adder.  This  adder  is  the  basis  for  both 
Nowick’s  speculative  completion  adder  and  the  adder  proposed  in  this  paper. 

Figure  3  shows  the  schematics  of  a  32-bit  Brent-Kung  adder.  The  adder  consists  of  7  levels  of  logic, 
each  with  a  simple  CMOS  implementation  and  a  fanout  of  2.  In  general,  an  n-bit  adder  is  implemented 
using  log{n)  +  2  levels  of  logic.  The  adder  is  fast,  amenable  to  regular  layout,  and,  most  important  for  our 
work,  it  is  based  on  a  simple,  repetitive  structure,  which  can  easily  be  generated  automatically. 

The  Brent-Kung  adder  is  implemented  as  follows.  Level-0  produces  all  propagate  (p)  and  generate  (g) 
signals.  Level-6  produces  the  sum  as  an  XOR  of  level-0  propagates  and  level-5  generates.  The  intermediate 
levels  compute  the  propagate  and  generate  signals  for  increasing  runs  of  bits:  level- 1  computes  all  2-bit 
P  and  G  values,  level-2  all  4-bit  P’s  and  G’s,  and  so  on.  In  general,  the  expressions  for  level  i  and  bit  j 
propagates  and  generates  are:  Pj  =  Pj~^  •  PjZl  and  G*  =  -|-  Pj~^  •  where  k  =  2*. 
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2.3  Speculative  Completion 

Speculative  Completion  [13,  1 1]  is  a  method  for  designing  asynchronous  datapath  components.  This  method 
has  the  advantages  of  matched  delay  implementations — the  use  of  single-rail  synchronous  functional  units, 
but,  more  in  the  spirit  of  completion  detection  implementations,  the  functional  units  have  several  associated 
delays:  one  for  the  worst  case,  and  one  or  more  speculative  delays  for  the  cases  where  the  result  is  ready 
earlier.  The  functional  unit  is  also  augmented  with  a  termination-logic  network,  which  operates  in  parallel 
with  the  functional  unit,  and  speculatively  selects  between  the  several  associated  delays. 

As  an  exemplification  of  speculative  completion.  Nowick  et  al.  have  proposed  a  speculative  adder  [13, 
11].  This  adder  is  based  on  the  Brent-Kung  adder.  The  adder  can  either  complete  its  computation  “early” 
(after  only  4  levels  of  logic),  or  “late”  (after  all  6  levels  of  logic).  They  prove  that  the  necessary  condition  for 
“late”  completion  is:  Late  completion  can  only  occur  if  there  exists  a  run  of  8  consecutive  Level-0  propagate 
signals. 

The  termination-logic  network  detects  when  an  addition  may  complete  late.  Since  an  exact  implementa¬ 
tion  of  late  completion  detection  is  prohibitive  (the  SOP  has  exactly  200  literals),  the  termination  logic  only 
safely  approximates  the  condition  for  late  completion.  Nowick  has  proposed  three  different  termination- 
logic  networks  for  general  32-bit  adders.  They  also  propose  several  other  networks  which  can  be  used  for 
specialized  adders,  for  example,  adders  which  are  guaranteed  to  have  one  summand  of  only  a  few  bits. 
However,  these  termination-logic  networks  have  fixed  implemenfafions,  and  can  nof  adapf  fo  adders  wifh  a 
non-uniform  disfribufion  of  inpufs. 

2.4  Spatial  Computation 

“Spatial  Compufafion”  (SC)  [5,  3]  is  a  model  of  compulation,  which  is  based  on  Ihe  Iranslalion  of  high- 
level  language  programs  direclly  info  hardware  slrucfures.  SC  program  implemenfafions  are  complefely 
dislribuled,  wifh  no  cenfralized  conlrol.  SC  circuils  are  oplimized  for  wires  al  Ihe  expense  of  compulation 
unite. 

A  particular  implemenlalion  of  SC  is  ASH  (Applicafion-Specific  Hardware)  [5].  Under  Ihe  assump- 
lion  lhal  compulation  is  cheaper  lhan  communicalion,  ASH  replicales  compufafion  unils  lo  simplify  inler- 
connecl,  building  a  syslem  which  uses  very  simple,  complefely  dedicaled  communicalion  channels.  As  a 
consequence,  communication  on  Ihe  dalapalh  never  requires  arbilralion;  Ihe  only  arbilralion  required  is  for 
accessing  memory.  ASH  relies  on  very  simple  hardware  primilives  and  uses  no  associafive  slrucfures,  no 
mulliporled  regisler  files,  no  scheduling  logic,  no  broadcasl,  and  no  clocks.  As  a  resull,  ASH  hardware  is 
fasl  and  exlremely  power  efficienl. 

CASH  is  a  fully  aulomaled  compiler  which  lakes  ANSI  C  as  inpul  and  Iranslales  if  info  Pegasus  [3],  a 
dalaflow  inlermediale  represenlalion.  CAB  (CASH  Asynchronous  Back-End)  [18]  Ihen  lakes  Ihe  Pegasus 
represenlalions  and  synlhesizes  Ihem  info  micropipelined  implemenfafions,  where  each  slage  communicales 
using  4-phase  bundled-dala  protocols.  CASH  performs  a  weallh  of  oplimizalions  on  Ihe  code,  of  which 
bil-widlh  and  range  analysis,  and  consfanl  propagation  are  of  direcl  consequence  lo  Ihis  paper  since  Ihey 
allow  us  lo  creale  specialized  adders. 

To  heller  undersland  Ihe  role  of  Spalial  Compulation  in  specializing  adders.  Figure  1  shows  Ihe  Pegasus 
represenlalion  of  a  simple  sum-of-squares  program,  lhal  uses  i  as  an  induction  variable  and  sum  to  accu¬ 
mulate  Ihe  sum  of  Ihe  squares  of  i.  On  Ihe  righl  is  Ihe  program’s  Pegasus  represenlalion,  which  consisls  of 
Ihree  hyperblocks  [5].  Hyperblock  1  initializes  sum  and  i  to  0.  Hyperblock  2  represenls  Ihe  loop;  il  conlains 
Iwo  MERGE  nodes,  one  for  each  of  Ihe  loop-carried  values,  sum  and  i.  Hyperblock  3  is  Ihe  function  epilog, 
conlaining  jusl  Ihe  RETURN.  Back-edges  wilhin  a  hyperblock  denote  loop-carried  values;  in  Ihis  example 
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int  squares  ( ) 

{ 

int  i  =  0, 

sum  =  0; 

for  (;i<10;i++) 
sum  +=  i*i; 
return  sum; 

} 


Figure  1:  C  program  and  its  representation  comprising  three  hyperblocks;  each  hyperblock  is  shown  as  a 
numbered  rectangle.  The  dotted  lines  represent  predicate  values.  (This  figure  omits  the  token  edges  used  for 
memory  synchronization.) 


there  are  two  sueh  edges  in  hyperbloek  2;  baek-edges  always  eonneet  an  ETA  (loop  exit)  to  a  MERGE  (loop 
entry)  node. 

Figure  1  has  two  individual  32-bit  adders.  However,  eaeh  of  these  adders  is  strietly  speeialized.  The 
adder  for  the  induetion  variable  always  adds  with  1,  whereas  the  adder  for  the  aeeumulation  of  sum  adds 
more  variable  values.  This  information,  easily  obtained  by  analysis  in  CASH,  helps  us  to  speeialize  the 
termination-logie  network  for  eaeh  adder.  For  example,  for  the  addition  with  eonstant  ‘1’,  this  network  ean 
be  as  small  as  a  single  literal  (pj):  the  upper  31  bits  of  eonstant  ‘1’  are  all  zero’s,  whieh  means  that  the  only 
time  a  late  addition  oeeurs  is  when  there  is  a  run  of  propagate  bits  originating  in  position  1  and  ending  in 
position  7.  Of  eourse,  for  a  more  aeeurate  deteetion,  more  literals  should  be  used.  Note,  however,  that  they 
need  not  be  evenly  distributed  throughout  the  32  bits  and  rather  they  should  be  eoneentrated  on  the  lower  8 
bits. 


3  Early  Termination  Detection 

The  early-termination  deteetion  network  must  quiekly  determine  if  all  the  generate  signals  at  a  given  level 
are  equal  to  the  final  result.  If  so,  it  is  safe  to  terminate  the  addition  early.  The  generate  signal  for  bit  i  at 
level  n,  G^,  is  the  sum  +  PiG’^Z^^-i-  Clearly,  if  P*  =  0  then  G"  =  This  is  a  suffieient,  but  not 

neeessary  eondition,  for  early  eompletion.  P;  =  0  if  and  only  if  Y\^j^i_j{pj  =  0).  If  all  sueh  produets  are 
zero,  then  the  propagate  signals  are  all  zero  and  the  generate  signals  of  sueeessive  levels  remain  unehanged. 
Therefore,  we  ean  prediet  the  early-termination  behavior  of  an  addition  as  a  funetion  of  the  level-0  propagate 
bits,  p,  with  the  foilwing  funetion  LAT E  whieh  evaluates  to  zero  if  and  only  if  the  addition  ean  be  early 
terminated: 


LATE  =  PoPiP2PsPaP5P6P7  +  PiP2P3PmP6P7Ps  +  ■■■ 

+  7'24P25F26P27P287'29P30F31 
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We  refer  to  eaeh  term  in  LATE  as  a  run  and  denote  the  run  starting  at  bit  i  as  rj.  For  example,  the  first  term 
is  run  tq. 

It  is  not  feasible  for  the  early-termination  deteetion  network  to  exaetly  eompute  LATE.  Instead,  a  safe 
approximation  is  used.  The  LATE  funetion  is  safely  approximated  if  all  the  runs,  vq  . . .  r24,  are  covered. 
Runs  are  eovered  by  a  funetion  over  p  if  the  funetion  safely  approximates  the  eontribution  of  eaeh  of  the 
runs  to  LATE\  that  is,  if  the  funetion  will  only  return  zero  if  the  OR  of  the  eovered  runs  is  zero  (but  the 
funetion,  being  an  approximation,  may  also  return  a  one  in  this  ease).  A  safe  approximation  never  ineorreetly 
identifies  an  addition  as  being  early,  but  may  ineorreetly  identify  an  addition  as  being  late. 

Previous  work  [13]  has  used  the  following  uniform  safe  approximations  in  early-termination  networks: 

•  f^3x5  =  PbP6P7  +  PIIP12P13  +  P17P18P19  +  P23P24P25  +  P29P30P31 

•  Lixb  =  P4P5P6P7  +  P9PIOPIIP12  +  Pl4Pl5Pl6Pl7  +  Pl9P20P2lP22  +  P24P25P2&P27 

•  LsxT  =  P3P4P5P6P7  +  P7P8P9P10P11  +  PuPl2Pl3Pl4Pl5  +  Pl5Pl6Pl7Pl8Pl9  +  Pl9P20P2lP22P23  + 
P23P24P25P26P27  +  P27P28P29P30P31 

As  the  number  and  size  of  terms  grows,  the  aeeuraey  of  the  approximation  inereases,  at  the  eost  of 
inereased  eireuit  size  and  exeeution  time. 

3.1  Using  Application  Specific  Information 

Here  we  deseribe  how  we  exploit  applieation  speeifie  information  to  design  early  terminating  adders  whieh 
exploit  applieation  speeifie  information.  Instead  of  implementing  a  uniform  funetion  in  the  early-termination 
network,  a  eustom  funetion  is  used  that,  given  the  runtime  behavior  of  the  applieation,  more  aeeurately 
approximates  the  LATE  funetion  while  not  exeeeding  timing  and  spaee  eonstraints. 

We  use  dynamie  programming  to  generate  the  eustom  approximation  funetion.  The  dynamie  program¬ 
ming  algorithm  builds  up  a  solution  by  finding  the  funetion  Eij  with  the  best  seore  Bij  whieh  eovers  the  runs 
A  •  •  •  fiast  using  no  more  than  j  literals.  The  seore  of  a  funetion  is  determined  by  a  probability  predictor 
whieh,  based  on  profile  information,  returns  the  probability  that  the  funetion  returns  zero. 

A  safe  approximation  of  LAT E  ean  be  built  up  from  literals  and  terms.  A  literal  p*  eovers  a  run  Vj  if  pi 
is  eontained  within  rj.  For  example,  pi  eovers  both  the  runs  tq  and  ri.  If  pi  =  0,  Vj  must  be  zero  whereas 
if  Pi  =  1  we  ean  safely  approximate  Vj  as  being  one.  A  run  is  eovered  by  a  term  if  it  is  in  the  interseetion  of 
the  runs  eovered  by  the  literals  of  the  term.  For  example,  the  term  pips  eovers  only  the  run  ri.  A  disjunetion 
of  terms  eovers  the  union  of  the  runs  eovered  by  the  terms  of  the  disjunetion.  For  example,  p7  -|-  pis  eovers 
the  runs  ro  . . .  ris. 

Our  algorithm  for  building  up  a  termination  approximation  funetion  from  profile  data  is  given  in  Fig¬ 
ure  2.  The  algorithm  builds  up  a  safe  approximation  funetion  by  greedily  adding  a  term  that  eovers  an 
additional  run  to  a  previously  eomputed  funetion  sueh  that  the  probability  that  the  resulting  funetion  returns 
zero  is  maximized.  The  algorithm  is  not  optimal  sinee  the  term  and  bit  probabilities  are  not  independent. 
If  no  independenee  assumptions  were  made,  an  optimal  algorithm  would  have  to  eonsider  an  exponential 
number  of  funetions.  The  running  time  of  the  algorithm  is  0{NLIIR2^)  where  N  is  the  number  of  bits,  L 
is  the  maximum  number  of  literals  desired.  If  is  the  running  time  of  the  probability  predietor,  and  R  is  the 
number  of  bits  in  a  run.  As  N,  L,  and  R  are  typieally  small  eonstants,  the  running  time  of  the  algorithm  is 
dominated  by  the  probability  predietor. 


5 


for  i  =  MAXJIUN  to  0  do 
for  ^  =  1  to  L  do 
best-prob  =  —  1 
best  June  =  0 

for  all  t  such  that  t  covers  n  do 
I'  =  l  —  numJiteralsJ) 

for  all  j  >  i  s.t.  ^/kyifk  is  covered  by  F[j\  [l'\  +  t  do 
if  P{F[j][l']  +t)  >  best-prob  then 
best-prob  =  P{F[j][l']  +  t) 
best  June  =  F[j]  [/']  +  t 

end  if 
end  for 
end  for 

F[i][^]  =  best  June 

end  for 
end  for 

Figure  2:  Dynamic  programming  algorithm  for  building  a  function  with  at  most  L  literals  that  safely  ap¬ 
proximates  the  LATE  function  and  maximizes  the  probability  that  the  function  returns  zero  (early)  as 
determined  by  a  probability  predictor  P  modulo  the  (false)  assumption  that  the  probability  of  a  term  being 
zero  is  independent  of  the  value  of  other  terms.  For  clarity  the  boundary  condition  checking  is  ommitted. 

3.2  Probability  Predictors 

Both  the  executation  time  and  result  quality  of  our  termination  approximation  function  generator  depend 
heavily  on  the  probability  predictor.  The  probability  predictor  uses  profile  data  to  assign  probabilities  that 
a  given  function  of  p  is  zero.  The  profile  dafa  consisfs  of  a  frace  of  addition  inpufs  for  some  execufion  of 
a  program.  A  perfeclly  accurafe  probabilify  prediefor  would  evaluafe  fhe  funclion  on  every  addition  in  fhe 
frace  and  refurn  fhe  percenfage  fhaf  evaluafed  fo  zero.  Such  a  prediefor  would  run  in  lime  proporlional  lo  fhe 
size  of  fhe  frace  file  and  so  would  nol  be  feasible  in  praclice.  Inslead,  if  is  necessary  lo  summarize  fhe  dafa 
and  exlracf  an  approximale  probabilify  from  fhe  summary.  We  consider  Ihree  differenl  probabilify  prediefors 
wilh  differenl  Irade-offs  belween  accuracy  and  running  times: 

Bits  Predictor  This  predictor  summarizes  the  data  using  bit  frequencies.  The  probability  of  a  function  is 
determined  by  assuming  bit  independence.  The  summary  of  the  profile  dafa  can  be  collecled  and 
stored  in  0{N)  space,  buf  fhe  prediefor  makes  slrong  independence  assumplions  lhal  affecl  ils  accu¬ 
racy.  The  running  lime  of  Ihis  prediefor  is  0{NR). 

Terms  Predictor  This  prediefor  summarizes  fhe  dafa  using  lerm  frequencies.  The  probabilify  of  each  ferm 
af  every  bif  posilion  is  slored.  The  probabilify  of  a  function  is  delermined  by  assuming  indepence 
belween  ferm  probabilities.  The  summary  of  fhe  profile  dafa  can  be  collecled  and  stored  in  0{N2^) 
space,  buf  fhe  predictor  makes  weaker  independence  assumption  lhan  fhe  bils  prediefor.  The  running 
time  of  this  predictor  is  0{N). 

Sampling  Predictor  This  predictor  summarizes  the  data  by  randomly  selecting  a  subset  of  the  data.  The 
probability  of  a  function  is  determined  by  evaluating  the  function  on  all  the  members  of  the  sample. 
This  predictor  requires  an  arbitrary  amount  of  memory  depending  on  the  sample  size.  The  accuracy 
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Figure  3:  Brent-Kung  adder 


of  the  predietor  is  expeeted  to  inerease  with  the  sample  size,  but  the  running  time  is  0{NRS)  where 
S  is  the  sample  size. 


4  Adder  Architecture 

The  previous  two  seetions  have  deseribed  the  basie  Nowiek  speeulative  eompletion  adder  and  our  algo¬ 
rithm  to  dynamieally  eonstruet  termination-logie  funetions.  This  seetion  deseribes  the  modiheations  to  the 
original  Nowiek  adder.  First,  we  deseribe  the  ehanges  in  sum  generation  whieh  render  statie-CMOS  imple¬ 
mentations  speed-effeetive.  Seeond,  we  briefly  present  several  issues  involved  in  the  teehnology  mapping 
of  termination-logie  networks. 

4.1  New  Adder  Implementation 

The  original  speeulative  adder  in  [13]  was  built  using  dynamie  logie  and  manually  implemented  at  transistor- 
level.  Although  still  eorreet,  a  statie-CMOS  implementation  has  the  main  disadvantage  of  introdueing  sig- 
nifieant  delays  in  the  sum  generation  level,  resulting  in  an  average-ease  that  is  greater  than  the  lateney  for  the 
unmodified  Brent-Kung  adder.  In  our  system,  CAB  synthesizes  eireuits  using  standard  gates  from  vendor 
libraries,  so  dynamie  logie  implementations  were  not  available  and  a  better  solution  than  the  original  was 
required. 

Two  small,  yet  erueial,  modifieations  of  the  original  speeulative  adder  design  yield  speeulative  adders 
useful  for  standard-gate  implementations.  The  first  modifieation  addresses  the  sum  generation  logie;  the 
seeond  one  modifies  the  implementation  of  intermediate  “generate”  eells. 
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Figure  4:  Modification  for  sum  generation  at  level-6. 


Figure  5:  Generate  function  at  level  three. 


4.1.1  Sum  Generation 

The  original  sum  generation  logic  [13]  (see  Figure  3)  increases  the  delay  of  the  last  level  in  the  Brent- 
Kung  adder  from  one  XOR  gate  to  four  gates,  including  three  XOR  gates  and  one  complex  AOI  gate.  As 
recognized  by  Nowick,  this  unacceptably  increases  the  average-case  delay  of  the  adder,  which  now  may 
top  the  delay  of  the  un-modified  adder.  Therefore,  to  render  a  speculative  adder  useful  in  standard  gate 
implementations,  a  better  solution  is  introduced  here. 

The  sum  generation  cell  for  a  32-bit  adder  implements  s,  =  ©  {early  ■  Gj  +  early G^).  First,  it  has 

to  select  between  the  generate  signals  from  level-3  (for  early  completion)  or  level-5  (for  late  completion). 
Second,  it  needs  to  XOR  the  selected  generate  signal  with  the  propagate  from  the  first  level  of  logic.  A 
direct  implementation  would  introduce  a  MUX  gate  on  the  critical  path,  and  would  require  broadcasting  the 
result  of  termination  logic  to  all  sum  generation  cells. 

Our  new  sum  implementation  takes  into  account  the  nature  of  the  generate  signals.  The  generate  function 
for  each  level  is  G*-  =  +  Pj~^  ■  where  i  is  the  adder  level,  j  is  the  bit  position,  and  k  =  2*“^. 

This  function  is  monotonically  increasing  in  the  value  of  Gj .  Therefore,  assuming  the  initial  values  of  G^ 
and  G^  are  zero,  the  selection  of  the  proper  generate  signal  in  the  sum  generation  is  simplified  fo  an  OR 
gafe.  Figure  4  shows  fhe  new  implemenfafion  for  fhe  sum  generation;  since  level-3  and  level-5  generafe 
signals  are  negafed,  fhe  OR  gafe  becomes  a  NAND  gafe. 

4.1.2  Partial  Reset  Logic 

The  new  sum  implemenfafion  assumes  fhaf  Gj  and  G^  are  zero  af  fhe  sfarf  of  fhe  compufafion.  However, 
fhis  may  nof  be  fhe  case,  and  fhe  adder  needs  fo  be  resef  before  performing  a  new  addifion.  Such  a  solution 
may  be  un-accepfable  from  fhe  poinf  of  view  of  performance  penalfies  infroduced:  eifher  exfra  gafes  need 
fo  be  placed  on  fhe  inpufs,  or  on  fhe  infermediafe-level  gafes. 

Our  solufion  fo  fhis  problem  is  fo  resef  fhe  adder  partially,  i.e.  resef  only  selecfed  signals  inside  fhe 
adder.  The  sum  generation  uses  only  level-3  and  level-5  generafe  signals,  so  fhese  are  obvious  candidafes. 
However,  a  beffer  solufion  is  possible.  Since  fhe  generafe  functions  are  monofonically  increasing,  if  is 
sufficienf  fo  resef  only  fhe  level-3  generafe;  affer  fwo  gafe  delays,  level-5  generafes  also  becomes  zero. 

The  implemenfafion  for  fhe  new  level-3  generafe  function  is  shown  in  Figure  5.  If  has  only  one  exfra 
inpuf,  a  resef  signal,  which  is  active-0:  1  during  compufafion  and  0  during  resef. 

The  new  adder  infroduces  a  timing  consfrainf  on  fhe  environmenf:  fhe  delay  befween  fwo  consecufive 
additions  has  fo  be  greafer  fhan  fhe  delay  of  level-4  and  level-5  generafe  cells  combined.  In  effecl,  fhe  adder 
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Figure  6:  Architecture  of  our  early  terminating  adder. 

now  works  in  fundamental-mode  [17,  12].  However,  it  is  very  unlikely  that  this  constraint  will  be  violated, 
since  the  adder  is  assumed  to  work  in  designs  communicating  with  4-phase  handshaking:  the  return-to-zero 
phase  (which  include  the  partial  adder  reset)  is  at  least  the  worst-case  delay  of  the  adder. 

The  main  drawback  of  the  new  adder  is  increased  power  consumption  since  the  level-6  sum  bits  now 
potentially  perform  twice  the  number  of  transitions  of  the  original  adder  during  each  cycle  of  computation. 

4.1.3  Adder  Architecture 

Figure  6  shows  the  new  architecture  of  the  adder.  At  this  level,  the  only  difference  from  the  original 
speculative-completion  adder  is  the  introduction  of  a  new  RESET  signal  which  is  used  as  described  above. 
Since  it  is  assumed  that  the  adder  is  used  in  modules  which  communicate  with  4-phase  handshaking,  the 
RESET  signal  can  simply  be  connected  to  the  “req”  input:  when  “req”  =  1 ,  the  adder  is  ready  to  perform  its 
computation,  when  “req”  =  0,  the  adder  is  partially  reset  for  the  next  computation. 

4.2  Technology  Mapping  for  Termination  Logic 

The  termination  logic  for  speculative  adders  has  to  obey  two  constraints.  Eirst,  the  delay  of  the  termination 
logic  has  to  meet  a  timing  constraint  to  avoid  hazards.  Second,  its  output  has  to  be  glitch  free  after  a  certain 
delay. 

The  output  of  the  termination-logic  network  selects  between  the  early  delay  and  the  late  delay;  the  output 
of  the  selection  is  the  “done”  signal  for  the  entire  adder,  which  has  to  be  glitch-free.  The  matching  delays 
are  built  as  chains  of  inverters,  so  they  are  hazard-free.  Therefore,  the  termination  logic  has  two  constraints. 
Eirst,  it  has  to  compute  faster  than  the  early  delay  {6tl  <  Nearly)-  Second,  the  termination  logic  has  to 
maintain  a  glitch-free  output  dearly  after  the  start  of  the  computation. 

The  termination  logic  is  technology-mapped  by  our  compiler.  The  input  is  a  simple  2-level  logic  func¬ 
tion,  and  the  output  is  structural  Verilog.  The  compiler  applies  simple  transformations  to  the  logic  function 
(factoring,  deMorgan’s  laws),  producing  a  fast  and  small  implementation.  However,  the  actual  delays  of  the 
adder  and  of  the  termination  logic  are  not  known  until  later,  when  commercial  CAD  tools  estimate  the  ac¬ 
tual  delays  after  place  and  route.  Therefore,  our  solution  is  to  characterize  these  delays  in  the  back-end,  and 
check  for  the  timing  constraint  6tl  <  dearly  If  the  condition  holds,  then  the  adder  is  correct;  otherwise,  the 
adder  is  flagged,  and  the  compiler  is  run  again  to  generate  this  particular  adder  as  a  standard,  non-speculative 
adder. 
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5  Results 


We  evaluate  our  applieation  speeifie  early-terminating  adders  in  the  eontext  of  spatial  eomputation  and 
Applieation  Speeifie  Hardware  (ASH).  The  optimizing  CASH  eompiler  eompiles  standard  C  programs  into 
asynehronous  eireuits  represented  in  struetural  Verilog.  The  gate-level  Verilog  generated  is  synthesized  with 
Synopsys  Design  Compiler  2002.05-SP2,  plaeed-and-routed  with  Silieon  Ensemble  5.3,  and  simulated  with 
Modelteeh  Model  si m  SE5.7. 

In  the  eontext  of  our  work,  the  important  feature  of  the  spatial  eomputation  model  is  that,  unlike  a  mono- 
lithie  proeessor,  there  is  not  just  a  single  adder.  Instead,  every  statie  add  in  a  program  translates  into  an  adder 
in  the  final  eireuit.  This  feature  allows  us  to  not  only  ereate  an  applieation  speeifie  eusfom  adder,  but  to  ere- 
ate  many  applieation-eontext  speeifie  adders.  Previous  work[13]  eonsidered  address  eomputations,  braneh 
offsets,  and  ordinary  arithmetie  as  separate  eases;  with  spatial  eomputation  we  ean  take  this  speeialization 
to  the  extreme  and  optimize  eaeh  adder  based  not  only  on  the  type  of  the  add,  but  the  program  eontext. 

A  funetional  simulator  of  ASH  was  used  to  eolleet  traees  of  additions  and  subtraetions.  The  eomplete 
traees  were  than  summarized  as  appropriate  for  eaeh  probability  predietor  and  the  termination-logie  funetion 
was  generated  for  eaeh  individual  adder  as  well  as  for  a  single  monolithie  adder.  The  generated  funetions 
were  then  provided  as  input  to  the  eompiler,  whieh  generates  speeulative  adders  ^ 

5.1  Early-Termination  Detection 

Although  our  modified  adder  eompletes  fasfer  than  an  un-modified  one  if  it  terminates  early,  it  is  slower  than 
un-modified  when  it  eompletes  late.  Thus,  in  order  for  early  termination  to  be  an  effeetive  strategy,  there 
must  be  a  signifieant  pereentage  of  adds  whieh  ean  be  deteeted  as  being  early.  In  our  32-bit  implementation, 
an  early  add  eompletes  in  .56ns,  but  a  late  add  eompletes  in  .73ns  eompared  to  .66ns  for  a  normal  adder 
(see  Table  1).  This  means  that  more  than  41.17%  of  the  additions  must  be  early,  or  the  average  exeeution 
time  of  the  adder  will  be  worse  than  the  standard  adder.  As  shown  in  Eigure  7,  early  adds  are  eommon 
enough  that  implementing  early  termination  is  benefieial.  Eor  subtraetions,  only  8  out  of  31  benehmarks 
would  benefit  from  using  an  early-termination  meehanism;  to  make  subtraetions  effeetive,  more  applieation 
speeifie  information  must  be  used.  This  is  a  topie  for  future  researeh. 

The  aeeuraey  of  our  lateness  approximation  funetions  is  shown  in  Eigure  8.  We  eompare  funetions  with 
at  most  35  literals  derived  using  different  probability  predietors  as  well  as  the  monolithie  versus  spatial 
models.  As  expeeted,  the  spatial  model  using  the  terms  predietor  does  the  best  with  an  average  error  of  less 
than  one  pereent.  Using  applieation  speeifie  information  is  useful  even  under  the  monolithie  model  with  the 
bits  and  terms  predietors  averaging  5.17%  and  4.54%  error  versus  6.85%  for  the  uniform  funetion. 

The  advantage  of  using  applieation  speeifie  information  is  greatest  when  the  number  of  literals  in  eon- 
strained.  An  termination-logie  network  that  takes  35  literals  as  input  is  still  fast  enough  to  sueeessfully 
trigger  early  termination  (Eigure  10).  However,  using  fewer  literals  results  in  a  smaller  eireuit  and  redueed 
power  eonsumption.  As  shown  in  Eigure  9,  the  eombination  of  optimizing  for  applieation  speeifie  behavior 
and  the  spatial  eomputing  model  results  in  lateness  approximation  funetion  whieh,  using  only  15  literals, 
have  better  aeeuraey  than  a  uniform  funetion  of  35  literals. 

Even  when  a  maximum  of  35  literals  are  allowed  in  the  lateness  approximation  funetions,  the  full  benefit 
of  the  funetion  ean  be  obtained  by  using  fewer  literals.  The  distribution  of  funetion  sizes  is  shown  in 
Eigure  11.  This  distribution  is  bimodal.  Adders  where  only  the  lower  bits  are  typieally  affeeted,  sueh  as 

*We  are  currently  integrating  the  speculative  adder  generation  with  the  rest  of  the  CAB  synthesis  path.  We  will  have  the 
simulations  of  Mediabench  kernels  within  one  month. 
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Figure  7 :  The  percent  of  all  additions  in  each  benchmark  that  support  early  termination  as  determined  by 
LATE.  The  threshold  above  which  it  becomes  beneficial  to  use  an  early-terminating  adder  (the  overhead 
of  the  implementation  is  compensated  for  by  the  performance  gains  of  the  early  additions)  is  shown  as  a 
dashed  line.  All  but  one  benchmark  exceed  this  threshold. 


those  involved  in  address  computations  and  incrementing,  need  fewer  than  the  maximum  number  of  literals 
to  detect  their  behavior  while  adders  involved  in  more  complicated  arithmetic  can  take  full  advantage  of  all 
available  literals. 

One  possible  disadvantage  of  using  a  profile  driven  optimization  is  that  the  result  of  the  optimization, 
in  this  case  the  early  termination  detection  functions,  may  be  overly  specific  fo  fhe  profiled  dafa  and  nol  be 
represenfafive  of  fhe  general  behavior  of  fhe  program.  To  demonsfrafe  fhis  effecl,  fhe  accuracy  of  fhe  early 
ferminafion  defection  funcfions  was  evaluafed  on  programs  from  Mediabench^  when  run  using  differenf 
inpufs  and  opfions  fhan  fhe  original  profiled  execution  (Figure  12).  In  many  cases  if  is  clear  from  fhe 
significanl  decrease  in  fhe  accuracy  of  early  termination  defecfion  fhaf  fhe  functions  are  overly  specific  fo 
fhe  profiled  inpul  sels.  These  resulls  clearly  indicafe  fhaf  additional  profile  dafa  should  be  used  fo  creafe  fhe 
besf  ferminafion  nel works  for  fhe  application.  For  example,  fhe  error  of  fhe  lerms  based  early  ferminafion 
defecfion  funcfions  for  fhe  spatial  compulation  model  is  over  60%  on  fhe  benchmark  epic_d  when  only 
the  initial  profile  dafa  is  used,  bul  when  fhe  combined  profile  dafa  of  bolh  sels  of  inpuls  is  used  fo  creafe  fhe 
funlion  fhe  error  drops  fo  1.57%  allhough  fhe  error  on  fhe  original  inpul  sel  increases  from  .71%  fo  2.07%. 
If  Ibis  new  funcfion  is  Ihen  evaluafed  on  anolher,  unprofiled,  execufion  Irace  fhe  error  of  Ibis  funcfion  is 

^The  other  benchmarks  either  did  not  have  alternative  inputs  or  did  not  have  alternative  inputs  where  the  addition  trace  was  a 
manageable  size. 
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Figure  8:  The  percent  error  of  various  lateness  approximation  functions  relative  to  the  LATE  function. 
As  expected,  the  application  specific  functions  do  better  than  the  uniform  function,  but  are  eclipsed  by  the 
individual  adders,  where  each  static  add  receives  its  own  custom-tailored  early  termination  network.  All 
functions  were  restricted  to  containing  no  more  than  35  literals. 


4.93%  which  is  comparable  to  Nowick’s  function.  However,  the  early  termination  detection  functions  for 
the  monolithic  model  perform  significantly  better  (2.35%  for  bits  and  2.08%  for  terms). 

5.2  Adder  Performance 

The  performance  of  the  dynamically  generated  adders  is  investigated  in  this  subsection.  First,  the  termination- 
logic  network  is  characterized.  Then,  the  dynamically  generated  adders  from  the  gsm_d  bench  are  charac¬ 
terized  for  speed,  area,  and  power. 

The  delay  of  the  termination  logic  must  be  less  than  the  early  delay  through  the  adder.  Figure  10  shows  a 
graph  of  the  delays  through  termination-logic  functions,  as  well  as  the  delay  threshold  for  early  addition  for 
a  32-bit  adder.  The  termination-logic  functions  were  randomly  generated,  and  the  maximum  delay  through 
each  is  shown  here.  Notice  that  for  termination  functions  of  up  to  50  literals,  the  timing  constraint  on  the 
termination  logic  is  met.  In  practice,  however,  the  number  of  literals  in  the  termination  logic  is  much  less 
than  this  threshold  number,  and  the  delays  through  the  termination  logic  are  much  smaller  than  the  early 
completion  delays  (see  Table  1). 

In  order  to  characterize  the  performance  of  the  dynamically-generated  speculative  adders,  the  gsm_d 
benchmark  was  used  as  an  example.  This  benchmark  generates  7  different  adders  (Table  1).  Of  these,  two 
(112  and  173)  are  general-purpose  adders,  while  the  other  are  specialized  adders  in  which  an  operand  is  a 
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Figure  9:  The  average  percent  error  for  various  lateness  approximation  functions  with  different  constraints 
on  the  number  of  literals.  As  expected,  as  the  number  of  literals  increases,  the  error  decreases.  The  use 
of  application  specific  information  has  the  largest  benefit  when  the  size  of  the  function  is  most  constrained. 
The  application-specific  spatial  computing  models  outperform  the  35  literal  uniform  function  using  20  fewer 
literals. 


ADD  ID 

Bitwidth 

Const 

Delay 

Early 

Delay 

Late 

Delay 
Term  Logic 

112 

32 

N 

0.56 

0.73 

0.2 

131 

32 

Y 

0.47 

0.63 

0.1 

15 

32 

Y 

0.47 

0.63 

0.1 

173 

17 

N 

0.55 

0.68 

0.2 

387 

32 

Y 

0.39 

0.59 

0.1 

551 

18 

Y 

0.46 

0.59 

0.1 

55 

32 

Y 

0.47 

0.63 

0.1 

Brent-Kung  0.66  ns 

Table  1:  The  performance  of  “gsm_d”  adders. 

constant  (see  column  Const).  All  but  two  (173  and  551)  are  32-bit  adders.  The  matching  delays  for  the  early 
and  late  completion  for  each  individual  adder  are  shown  in  columns  Delay  Early  and  Delay  Late,  while  the 
delay  of  each  termination-logic  network  is  shown  in  the  last  column.  For  comparison  purposes,  the  delay  of 
the  standard  Brent-Kung  adder  is  listed  in  the  last  table  entry. 

There  are  two  important  conclusions.  First,  the  latency  of  the  termination  logic  is  much  less  than  that 
of  the  early  delay  for  the  adder,  which  means  that  the  adders  are  correct.  Second,  notice  that  the  delay 
of  each  adder  variates  with  the  size  of  the  inputs,  as  well  as  with  one  of  the  inputs  being  a  constant.  The 
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Figure  10:  The  delays  for  technology-mapped  randomly-generated  termination  logic  functions.  Each  point 
is  the  maximum  delay  across  20  randomly  generated  functions.  The  “maximum  allowable  delay”  is  the 
delay  required  such  that  level-3  generate  signal  can  used  in  sum  computation. 


adder-generation  algorithm  performs  some  very  simple  optimizations  when  one  of  the  inputs  is  a  constant; 
however,  these  optimizations  are  limited  only  to  the  first  level  of  logic  in  the  adder. 

Figure  2  shows  the  area  of  each  gsm_d  area,  both  the  standard  version  and  the  speculative  completion 
one.  The  average  area  overhead  is  15%.  Notice  that,  even  if  two  speculative  adders  have  the  same  bitwidth, 
the  area  may  be  different:  it  is  influenced  by  the  size  of  the  termination-logic  function  and  by  whether  the 
adder  is  a  constant  adder.  The  last  entry  in  the  table  indicates  the  area  for  the  Tax 5  speculative  adder  in  [13]. 
With  all  the  transistor  resources  now  available  on  chips,  we  believe  that  the  area  increase  is  a  good  tradeoff 
with  speed. 

Figure  3  shows  the  power  consumption  for  each  adder  in  the  gsm_d  benchmark,  both  the  standard 
Brent- Kung  adder  and  the  speculative  adder.  On  average,  the  speculative  adders  consume  16%  more  power 
than  the  non-speculative  one.  In  comparison,  the  power  consumption  overhead  for  Nowick’s  speculative 
completion  adder  with  the  Tsxs  is  lower  (12%). 

6  Conclusion 

This  paper  presents  a  new  method  for  improving  the  speed  of  additions  by  using  application-specific  in- 
formafion.  It  builds  on  “speculative  completion”  [13],  but  our  experiments  show  that  it  results  in  better 
early-detection  rates  than  the  original  design  because  it  uses  detailed  knowledge  of  addition  data  profiles. 
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Figure  1 1 :  The  distribution  of  the  size  of  lateness  approximation  functions  across  all  adders  of  all  the  bench¬ 
marks.  The  unweighted  distribution  counts  each  adder  equally  whereas  the  weighted  distribution  weights 
each  adder  by  its  execution  frequency.  Adders  which  do  not  demonstrate  detectable  early  termination  be¬ 
havior  frequently  enough  to  benefit  from  early  termination  fall  in  the  n/a  category. 


ADD  ID 

Standard 
Adder  (/i^) 

Speculative 
Adder  (p^) 

Area 

Overhead 

112 

112.65 

135.17 

20.00% 

131 

91.67 

102.73 

12.06% 

15 

91.67 

102.24 

11.53% 

173 

56.73 

71.56 

26.14% 

387 

91.67 

97.16 

5.99% 

551 

48.66 

57.26 

17.68% 

55 

91.67 

102.73 

12.06% 

Avg: 

95.55 

83.53 

15.07% 

Nowick 

112.65 

135.94 

17.13% 

Table  2:  Area  for  gsm_d  adders:  the  standard  versions  and  the  speculative  versions. 
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Figure  12:  The  percent  error  of  various  lateness  approximation  functions  when  evaluated  on  program  execu¬ 
tions  using  different  inputs  than  the  profiled  execution.  A  maximum  of  35  literals  per  a  function  is  allowed. 
In  some  cases,  the  approximation  functions  remain  effective,  but  in  others  the  error  dramatically  increases 
compared  to  that  of  Nowick. 


ADD  ID 

Standard 
Adder  (mW) 

Speculative 
Adder  (mW) 

Power 

Overhead 

112 

4.91 

5.52 

12.42% 

131 

4.83 

5.34 

10.56% 

15 

4.34 

5.47 

26.04% 

173 

3.35 

4.01 

19.70% 

387 

4.78 

5.45 

14.02% 

551 

3.48 

4.13 

18.68% 

55 

4.87 

5.45 

11.91% 

Avg: 

4.37 

5.05 

16.19% 

Nowick  4-lit 

4.91 

5.52 

12.42% 

Table  3:  Power  consumption  for  gsm_d  adders:  the  standard  versions  and  the  speculative  versions. 
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while,  at  the  same  time,  produees  termination  logie  networks  whieh  have  fewer  literals  than  the  original  and 
are  thus  faster  and  smaller.  Profiling  teehniques  are  powerful,  but  the  data  sets  must  be  large  and  eomplete 
enough  sueh  that  these  teehniques  are  effeetive. 

A  seeond  eontribution  of  our  paper  is  a  modified  Brenf-Kung  adder  implemenfafion  whieh  is  more 
effieienf  for  speeulafive  eomplefion  feehniques.  This  adder  ean  be  easily  implemenfed  wifh  sfandard  gales, 
and  is  amenable  lo  aulomafie  generalion  by  CAD  fools. 

In  summary.  Ibis  paper  demonslrales  Ihe  power  of  eombining  asynehronous  eireuils  and  ASH.  By  using 
high-level  informalion  (e.g.  bil-widlh  analysis  and  profiling  informalion)  one  ean  optimize  low-level  eireuifs 
(e.g.  early  lerminafion  nelworks),  resulfing  in  improved  overall  performanee. 
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